12 research outputs found

    Toy Models of Superposition

    Full text link
    Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.Comment: Also available at https://transformer-circuits.pub/2022/toy_model/index.htm

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Full text link
    We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models

    Language Models (Mostly) Know What They Know

    Full text link
    We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe

    Specific versus General Principles for Constitutional AI

    Full text link
    Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely

    Scaling Laws and Interpretability of Learning from Repeated Data

    Full text link
    Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.Comment: 23 pages, 22 figure

    Overexpression of arginase alters circulating and tissue amino acids and guanidino compounds and affects neuromotor behavior in mice

    No full text
    Arginine is an intermediate of the ornithine cycle and serves as a precursor for the synthesis of nitric oxide, creatine, agmatine and proteins. It is considered to be a conditionally essential amino acid because endogenous synthesis only barely meets daily requirements. In rapidly growing suckling neonates, endogenous arginine biosynthesis is crucial to compensate for the insufficient supply of arginine via the milk. Evidence is accumulating that the intestine rather than the kidney plays a major role in arginine synthesis in this period. Accordingly, ectopic expression of hepatic arginase in murine enterocytes by genetic modification induces a selective arginine deficiency. The ensuing phenotype, whose severity correlates with the level of transgene expression in the enterocytes, could be reversed with arginine supplementation. We analyzed the effect of arginine deficiency on guanidine metabolism and neuromotor behavior. Arginine-deficient transgenic mice continued to suffer from an arginine deficiency after the arginine biosynthetic enzymes had disappeared from the enterocytes. Postweaning catch-up growth in arginine-deficient mice was characterized by increased levels of all measured amino acids except arginine. Furthermore, plasma total amino acid concentration, including arginine, was significantly lower in adult male than in adult female transgenic mice. Decreases in the concentration of plasma and tissue arginine led to significant decreases in most metabolites of arginine. However, the accumulation of the toxic guanidino compounds, guanidinosuccinic acid and methylguanidine, corresponded inversely with circulating arginine concentration, possibly reflecting a higher oxidative stress under hypoargininemic conditions. In addition, hypoargininemia was associated with disturbed neuromotor behavior, although brain levels of toxic guanidino compounds and ammonia were normal
    corecore